Chapter 2 Bias

2.1 Example: Different level of analysis

The maps demonstrate the geographic gradients of obesity and diabetes rates in the state of Utah and New York. The left panel is estimates at the census tract level, while the right panel is estimates at the county level.

## <br><br>
tmap_arrange(UT_Diabetes_tract, UT_Diabetes_county, ncol=2, outer.margins = 0.02)
cat("<br><br>")
## <br><br>
tmap_arrange(NY_Obesity_tract, NY_Obesity_county, ncol=2, outer.margins = 0.02)
cat("<br><br>")
## <br><br>
tmap_arrange(NY_Diabetes_tract, NY_Diabetes_county, ncol=2, outer.margins = 0.02)
cat("<br><br>")
## <br><br>

Here is a summary table.

2.2 Bias due to omitted confounders

\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_2 + \dots + \epsilon_i; \;\; for \;\; i=1, \dots, n\]

where the errors \(\epsilon_i \sim N(0, \sigma^2)\) with independent and identically distributed (i.i.d.)

Let’s assume the following association is true (i.e., gold standard) without any selection bias, measurement bias, and other unmeasured confoundings.

N <- 100000
C <- rnorm(N)
X <- .5 * C + rnorm(N)
Y <- .3 * C + .4 * X + rnorm(N)

2.2.1 Gold standard

With the correct model specification (i.e., \(C\) as a confounder), we get an unbiased estimate of \(X\) on \(Y\).

# Gold standard
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.001275   0.003167   0.403    0.687    
## X           0.401728   0.003155 127.319   <2e-16 ***
## C           0.305416   0.003545  86.161   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.002752)
## 
##     Null deviance: 142238  on 99999  degrees of freedom
## Residual deviance: 100272  on 99997  degrees of freedom
## AIC: 284068
## 
## Number of Fisher Scoring iterations: 2

2.2.2 Misspecified model: a confounder, \(C\), was omitted from the model

By omitting \(C\), the estimate of \(X\) was biased either “away from” or “towards to” the null

# C was omitted
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0006198  0.0032821   0.189     0.85    
## X           0.5234680  0.0029241 179.019   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.077185)
## 
##     Null deviance: 142238  on 99999  degrees of freedom
## Residual deviance: 107716  on 99998  degrees of freedom
## AIC: 291227
## 
## Number of Fisher Scoring iterations: 2

2.2.3 Bias “away from” or “towards” the null?

N <- 100000
C <- rnorm(N)
X <- -.5 * C + rnorm(N)
Y <- -.3 * C + .4 * X + rnorm(N)
# C was omitted
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.001155   0.003162   0.365    0.715    
## X            0.405403   0.003144 128.939   <2e-16 ***
## C           -0.297054   0.003532 -84.099   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9997488)
## 
##     Null deviance: 141635  on 99999  degrees of freedom
## Residual deviance:  99972  on 99997  degrees of freedom
## AIC: 283768
## 
## Number of Fisher Scoring iterations: 2
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.001605   0.003272   0.491    0.624    
## X           0.523397   0.002912 179.766   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.070449)
## 
##     Null deviance: 141635  on 99999  degrees of freedom
## Residual deviance: 107043  on 99998  degrees of freedom
## AIC: 290600
## 
## Number of Fisher Scoring iterations: 2

2.2.4 A \(C\) is not a confounder on \(X\) and \(Y\)

N <- 100000
C <- rnorm(N)
X <- rnorm(N)
Y <- .4 * X + rnorm(N)

2.2.5 Correct model specification: Without \(C\)

glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.007034   0.003166   2.222   0.0263 *  
## X           0.404101   0.003176 127.239   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.002215)
## 
##     Null deviance: 116445  on 99999  degrees of freedom
## Residual deviance: 100220  on 99998  degrees of freedom
## AIC: 284013
## 
## Number of Fisher Scoring iterations: 2

2.2.6 Misspecified model with \(C\)

glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.007039   0.003166   2.223   0.0262 *  
## X            0.404118   0.003176 127.243   <2e-16 ***
## C           -0.003170   0.003173  -0.999   0.3178    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.002215)
## 
##     Null deviance: 116445  on 99999  degrees of freedom
## Residual deviance: 100219  on 99997  degrees of freedom
## AIC: 284014
## 
## Number of Fisher Scoring iterations: 2

2.2.7 A \(C\) is a colloder on \(X\) and \(Y\)

N <- 100000
X <- rnorm(N)
Y <- .7 * X + rnorm(N)
C <- 1.2 * X + .6 * Y + rnorm(N)

2.2.8 Correct model specification: Without \(C\)

glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0003944  0.0031629  -0.125    0.901    
## X            0.6966664  0.0031746 219.447   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.000377)
## 
##     Null deviance: 148211  on 99999  degrees of freedom
## Residual deviance: 100036  on 99998  degrees of freedom
## AIC: 283829
## 
## Number of Fisher Scoring iterations: 2

2.2.9 Misspecified model with \(C\)

This is one of examples of selection bias. For example, let’s say, \(X\) is Education, \(Y\) is income, and \(C\) is social welfare program. People at lower education (i.e., high risk group in terms of exposure) and lower income (i.e., higher risk group in terms of outcome) are more likely to register social welfare program. If survey was conducted based on the registered social welfare program, the “estimated” association from this “disproportionally selected” respondents are likely biased.

glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0007675  0.0027135  -0.283 0.777298    
## X           -0.0164534  0.0046471  -3.541 0.000399 ***
## C            0.4414927  0.0023311 189.391  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.7362814)
## 
##     Null deviance: 148211  on 99999  degrees of freedom
## Residual deviance:  73626  on 99997  degrees of freedom
## AIC: 253178
## 
## Number of Fisher Scoring iterations: 2

2.3 Overadjustment bias

Please note that this is not a comprehensive example; only reflect one aspect of potential overadjustement bias.

Let’s assume a model with \(M\) as a mediator.

N <- 100000
X <- rnorm(N)
M <- .5 * X + rnorm(N)
Y <- .3 * X + .4 * M + rnorm(N)

2.4 Total effect

glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0006419  0.0034065  -0.188    0.851    
## X            0.5058361  0.0034073 148.456   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.160389)
## 
##     Null deviance: 141611  on 99999  degrees of freedom
## Residual deviance: 116037  on 99998  degrees of freedom
## AIC: 298667
## 
## Number of Fisher Scoring iterations: 2

2.5 Overadjustment

glm.unbiased <- glm(Y~X + M, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + M, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.002150   0.003164  -0.679    0.497    
## X            0.306941   0.003536  86.816   <2e-16 ***
## M            0.398001   0.003154 126.179   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.001023)
## 
##     Null deviance: 141611  on 99999  degrees of freedom
## Residual deviance: 100099  on 99997  degrees of freedom
## AIC: 283895
## 
## Number of Fisher Scoring iterations: 2

2.6 Logistic models

2.6.1 Sex as a Confounder, \(C\)

MYY <- data.frame(Sex = "Male",
                  Smoking = "Yes",
                  Cancer = 1,
                  freq = 5
                  )

MYN <- data.frame(Sex = "Male",
                  Smoking = "Yes",
                  Cancer = 0,
                  freq = 8
                  )

MNY <- data.frame(Sex = "Male",
                  Smoking = "No",
                  Cancer = 1,
                  freq = 45
                  )

MNN <- data.frame(Sex = "Male",
                  Smoking = "No",
                  Cancer = 0,
                  freq = 72
                  )


FYY <- data.frame(Sex = "Female",
                  Smoking = "Yes",
                  Cancer = 1,
                  freq = 25
                  )

FYN <- data.frame(Sex = "Female",
                  Smoking = "Yes",
                  Cancer = 0,
                  freq = 10
                  )

FNY <- data.frame(Sex = "Female",
                  Smoking = "No",
                  Cancer = 1,
                  freq = 25
                  )

FNN <- data.frame(Sex = "Female",
                  Smoking = "No",
                  Cancer = 0,
                  freq = 10
                  )

Ex_confounder <- rbind(MYY, MYN, MNY, MNN, FYY, FYN, FNY, FNN)

Convert Freq table to raw data

library(tidyr)
raw_confounder <- Ex_confounder %>% 
  uncount(freq)
glm.unbiased <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder)
summary(glm.unbiased)
## 
## Call:
## glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_confounder)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -0.1582     0.1627  -0.972   0.3309  
## SmokingYes    0.6690     0.3397   1.970   0.0489 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 277.26  on 199  degrees of freedom
## Residual deviance: 273.28  on 198  degrees of freedom
## AIC: 277.28
## 
## Number of Fisher Scoring iterations: 4
  • Full model:
glm_logit <- glm(Cancer ~ Smoking + Sex , family=binomial(link = "logit"), data=raw_confounder)
glm_logit
## 
## Call:  glm(formula = Cancer ~ Smoking + Sex, family = binomial(link = "logit"), 
##     data = raw_confounder)
## 
## Coefficients:
## (Intercept)   SmokingYes      SexMale  
##   9.163e-01    4.266e-15   -1.386e+00  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  197 Residual
## Null Deviance:       277.3 
## Residual Deviance: 257   AIC: 263
  • Stratified models
## For males
raw_confounder_M <- raw_confounder[ which(raw_confounder$Sex=='Male'), ]
glm_logit_m <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder_M)
glm_logit_m
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_confounder_M)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##  -4.700e-01    6.672e-16  
## 
## Degrees of Freedom: 129 Total (i.e. Null);  128 Residual
## Null Deviance:       173.2 
## Residual Deviance: 173.2     AIC: 177.2
# For females
raw_confounder_F <- raw_confounder[ which(raw_confounder$Sex=='Female'), ]
glm_logit_f <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder_F)
glm_logit_f
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_confounder_F)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##   9.163e-01    9.400e-16  
## 
## Degrees of Freedom: 69 Total (i.e. Null);  68 Residual
## Null Deviance:       83.76 
## Residual Deviance: 83.76     AIC: 87.76

2.6.2 Sex as a Moderator, \(M\)

MYY <- data.frame(Sex = "Male",
                  Smoking = "Yes",
                  Cancer = 1,
                  freq = 5
                  )

MYN <- data.frame(Sex = "Male",
                  Smoking = "Yes",
                  Cancer = 0,
                  freq = 4
                  )

MNY <- data.frame(Sex = "Male",
                  Smoking = "No",
                  Cancer = 1,
                  freq = 45
                  )

MNN <- data.frame(Sex = "Male",
                  Smoking = "No",
                  Cancer = 0,
                  freq = 68
                  )


FYY <- data.frame(Sex = "Female",
                  Smoking = "Yes",
                  Cancer = 1,
                  freq = 25
                  )

FYN <- data.frame(Sex = "Female",
                  Smoking = "Yes",
                  Cancer = 0,
                  freq = 14
                  )

FNY <- data.frame(Sex = "Female",
                  Smoking = "No",
                  Cancer = 1,
                  freq = 25
                  )

FNN <- data.frame(Sex = "Female",
                  Smoking = "No",
                  Cancer = 0,
                  freq = 14
                  )

Ex_moderator <- rbind(MYY, MYN, MNY, MNN, FYY, FYN, FNY, FNN)

Convert Freq table to raw data

library(tidyr)
raw_moderator <- Ex_moderator %>% 
  uncount(freq)
  • Full model:
glm_logit <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator)
glm_logit
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_moderator)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##     -0.1582       0.6690  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  198 Residual
## Null Deviance:       277.3 
## Residual Deviance: 273.3     AIC: 277.3
  • Stratified models
## For males
raw_moderator_M <- raw_moderator[ which(raw_moderator$Sex=='Male'), ]
glm_logit_m <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator_M)
glm_logit_m
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_moderator_M)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##     -0.4128       0.6360  
## 
## Degrees of Freedom: 121 Total (i.e. Null);  120 Residual
## Null Deviance:       165.1 
## Residual Deviance: 164.3     AIC: 168.3
# For females
raw_moderator_F <- raw_moderator[ which(raw_moderator$Sex=='Female'), ]
glm_logit_f <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator_F)
glm_logit_f
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_moderator_F)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##   5.798e-01   -2.621e-16  
## 
## Degrees of Freedom: 77 Total (i.e. Null);  76 Residual
## Null Deviance:       101.8 
## Residual Deviance: 101.8     AIC: 105.8